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Abstract 

Covariance selection seeks to estimate a covariance matrix by maximum likelihood while 
restricting the number of nonzero inverse covariance matrix coefBcients. A single penalty pa- 
rameter usually controls the tradeoff between log likelihood and sparsity in the inverse matrix. 
We describe an efficient algorithm for computing a full regularization path of solutions to this 
problem. 

1 Introduction 

We consider the problem of estimating a covariance matrix from sample multivariate data my 
maximizing its likelihood, while penalizing the inverse covariance so that its graph is sparse. This 
problem is known as covariance selection and can be traced back at least to [1]. The coefficients 
of the inverse covariance matrix define the representation of a particular Gaussian distribution as 
a member of the exponential family, hence sparse maximum likelihood estimates of the inverse 
covariance yield sparse representations of the model in this class. Furthermore, in a Gaussian 
model, zeroes in the inverse covariance matrix correspond to conditionally independent variables, 
so this penalized maximum likelihood procedure simultaneously stabilizes estimation and isolates 
structure in the underlying graphical model (see [2]). 

Given a sample covariance matrix S G S„, the covariance selection problem is written as follows 

maximize log det X - Tr(SX) - p Card(X) 

in the matrix variable X € S„, where p > is a penalty parameter controlling sparsity and 
Card(X) is the number of nonzero elements in X. This is a combinatorially hard (non-convex) 
problem and, as in [3, 4, 5, 6], we form the following convex relaxation 

maximize log det X - Tr(SX) - p\\X\\i (1) 

which is a convex problem in the matrix variable X € S„, where ||^||i is the sum of absolute 
values of the coefficients of X here. After scaling, the ||X||i penalty can be understood as a convex 
lower bound on Card(X) and this type of relaxation has been used in variable selection and matrix 
completion for example [7, 8, 9]. Another completely different approach derived in [10] reconciles 
the local dependence structure inferred from n distinct £i-penalized regressions of a single variable 
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against all the others. Both this approach and the convex relaxation (1) have been shown to be 
consistent in [10] and [11] respectively. 

In practice however, both methods are computationally challenging when n gets large. Various 
algorithms have been employed to solve (1) with [6] using a custom interior point method and [11] 
using a block coordinate descent method where each iteration required solving a LASSO-like prob- 
lem, among others. This last method is efficiently implemented in the GLASSO package by [12] 
using coordinate descent algorithms from [13] to solve the inner regression problems. 

One key issue in all these methods is that there is no a priori obvious choice for the penalty 
parameter so, in practice, at least a partial regularization path of solutions has to be computed, 
and this procedure is then repeated many times to get confidence bounds on the graph structure 
by cross-validation. Pathwise LASSO algorithms such as LARS by [14] can be used to get a inll 
regularization path of solution using the method in [10] but this still requires solving and reconciling 
n regularization paths on regression problems of dimension n. 

Our contribution here is to formulate a pathwise algorithm for solving problem (1) using numer- 
ical continuation methods (see [15] for an application in kernel learning). Each iteration requires 
solving a large structured linear system (predictor step) then improving precision using a block 
coordinate descent method (corrector step). Overall, the cost of moving from one solution to prob- 
lem (1) to another is typically much lower than that of solving two separate instances of (1). We 
illustrate the performance of our methods on several artificial and realistic data sets. 

The paper is organized as follows. Section 2 reviews some basic convex optimization results on 
the covariance selection problem in (1). Our main pathwise algorithm is described in Section 3. 
Finally, we present some numerical results in Section 4. 

Notation. In what follows, we write the set of symmetric matrices of dimension n. For a 
matrix X G R"*^", we write its Frobenius norm, ||X||i = the £i norm of its vector 

of coefficients, and Card(X) the number of nonzero coefficient in X. 

2 Covariance Selection 

Starting from the convex relaxation defined above 



in the variable X G Sn, where \\X\\i can be understood as a convex lower bound on the Card(X) 
function whenever \Xij\ < 1 (we can always scale p otherwise). Let us write X*(p) the optimal 
solution of problem (2). In what follows, we will seek to compute (or approximate) the entire 
regularization path of solutions X*{p), for p G R+. To remove the nonsmooth penalty, we can set 
X = L — M and rewrite the problem above as 



maximize logdetX — Tr(EX) — p|l-^|]i 



(2) 



maximize log det(i: - M) - Tr(S(L - M)) - pl'^{L + M)l 
subject to Lij, Mij > 0, i,j = l,...,n. 



(3) 



in the matrix variables L,M e S. 
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2.1 Dual &c optimality conditions 

We can form the following dual to problem (2) as 

minimize — logdet(J7) — n 

subject to Uij < p + Sjj (4) 

Uij > Sjj p 

in the variable U G S„. The KKT conditions (sec [16, §5.9.2]) for problem (3) and (4) axe then 
given by 

(L - M) = U-^ 

{p + i:ij-Uij)L,j = (5) 
(p - + Uij)Mij = 

in the variable J7 G S„ and L,M G S„. 

2.2 Regularized reformulation 

While problem (4) is a convex optimization problem with O(n^) linear constraints. As in [15] 
for example, in the spirit of barrier methods for interior point algorithms, we form the following 
(unconstrained) regularized problem 

(n n \ 

log(p + - Uij) + J2 log(P - ^ij + Uij) (6) 

in the variable C/ G S„ and t > specifies a desired tradeoff level between centrality (smoothness) 
and optimality. Prom every solution U*{t) corresponding to each t > 0, the barrier formulation also 
produces an explicit dual solution {L* {t) , M* (t)) to Problem (4). Indeed we can define matrices 
A, /X G S„ as follows 

Mij(U,p) = * (8) 

P ~ ^ij i" Uij 
or alternatively, in the complementarity format 

Lij{U,p){p + ^ij-Uij)=t (9) 
Mij{U, p){p - ^ij + Uij) = t. (10) 

Pirst order optimality conditions for problem (6) then imply (L — M) = U~^. As t tends to 0, 
problem (6) traces a central path towards the optimal solution to problem (4). If we write f{U) 
the objective function of problem (4) and call p* its optimal value, we get 

f{U*{t))-p* < 2nh 

hence t can be understood as a surrogate duality gap when solving problem (4). 
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3 Algorithm 

In this section we derive a Predictor- Corrector algorithm to approximate the entire path of solutions 
X*{p) when p varies between and maxj Sjj (beyond which the solution matrix is diagonal). Define 

H{U,p) = XiU,p)-i^{U,p)-U-' 

we trace the curve H{U,p) = (first order optimality conditions for problem (6)). 

Algorithm 1 Pathwise Covariance Selection 
Input: S G Sm 

1: Start with (Uo,po) s.t H{Uo,po) = 
2: for i = 1 to do 

3: Predictor Step. Let pi+i = pi + h. Compute a tangent direction by solving the linear system 

dH , , , BU 
— {Ui,pi) + J{Ui,pi)— = 

in dU /dp G S„, where J{Ui,pi) = dH{U, p)/dU € S„2 is the Jacobian matrix of the function 
H{U,p). 

4: Update Ui+i = Ui + hdU/dp. 

5: Corrector Step. Solve problem (6) starting aX U = i7j+i. 
6: end for 

Output: Sequence of matrices Ui, i = 1, . . . ,k. 

Typically here, h is a small constant, po — maxj Sjj and Uq is computed by solving a single 
(very sparse) instance of problem (6) for example. 

3.1 Predictor: conjugate Gradient method 

In the algorithm above, the tangent direction in the predictor step is computed by solving a linear 
system Ax = b where A = {W^ + D) and is a diagonal matrix. This system of equations 
has dimension and we solve it using the conjugate gradient (CG) method. 

CG iterations. The most expensive operation in the CG iterations is the computation of a matrix 

2 

vector product Ap^, with G . Here however, we can exploit problem structure to compute 
this step very efficiently. Observe that {U~^ (8) U''^)pk = \ec{U~^ PkU~^) when pk = vec(Pfc), so 
the computation of the matrix vector product Api^ needs only 0{n^) flops instead of O(n^). The 
CG method then needs at most O(ra^) iterations to converge, leading to a total complexity of O(n^) 
for the predictor step. In practice, we will observe that CG needs considerably fewer iterations. 

Stopping criterion. To speed up the computation of the predictor step, we can stop the con- 
jugate gradient solver when the norm of the residual falls below the numerical tolerance t. In our 
experiments here, we stopped the solver after the residual decreases by two order of magnitudes. 
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Projection If wc arc moving backward on the path (i.e. from sparse sohitions to dense ones, 
which is usuaUy faster), there is no guarantee that, after taking a step in the direction given by the 
predictor, the new matrix U wih be in the domain of (6), i.e. wiU satisfy: 

U yO, Uij < p + Uij > Eij - p, i,j = l,...,n. 

However, this domain is the intersection of a box with the semidefinite cone. Projecting on a box 

has cost O(n^) and projecting on the p.s.d. cone requires a single eigenvalue decomposition with 
complexity O(n^). This means that we can find the closest matrix U using alternating projections 
(see [17] for details). 

3.2 Corrector: block coordinate descent 

For small-sized problems, we can use Newton's method to solve the Problem (6). However from 
a computational perspective, this approach is not practical for large values of n. We can simplify 
iterations using a block coordinate descent algorithm that updates one row/column of the matrix 
in each iteration ([11]). Let us partition the matrices U and S as 

--(ir :) ^-{tr I) 

We keep V fixed in each iteration and solve for u, w. Without loss of generality, we can always 
assume that we are updating the last row/column. 

Algorithm. Problem (6) can be written in block format as: 

minimize — log{w — u^V~^u) — i(log(p + c — 1) + log(p — c + w)) 

(11) 

-2t (Ei log(p + bi- Ui) + J2i log(p -bi + Ui)) 

in the variables u G R^**"^) and w & R. Here V G S*^""^'' is kept fixed in each iteration. The 
above optimization unconstrained optimization problem can be solved for larger values of n using 
coordinate descent methods for example (as in [13]). We use Newton's method here because it has 
explicit complexity bounds and works well on medium scale problems. We summarize the block 
coordinate algorithm as follows. 

Algorithm 2 Block coordinate descent corrector steps 
Input: Uo, S G S„ 
1: for i = 1 to do 

2: Pick the row and column to update. 
3: Use Newton's method to solve Problem (11). 
4: Update Lf-^. 
5: end for 
Output: A matrix Uk solving (6). 



We can use the Sherman- Woodbury- Morrison (SWM) formula (see [16, §C.4.3]) to efficiently 
update at each iteration, so it suffices to compute the full inverse only once at the beginning of 
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the path. Assume that the last row/column of the block matrix Uk defined above is being updated 
and let {u*,w*) be the solution to problem (11). If we define the vectors 



the update step in U can be written 

C/fe+i = Uk + re^ + e^s, 

which means that {Uk+i)~^ can be computed using the SWM formula on this rank-two update 
of Uk- This step only requires a few matrix vector products, hence can be computed efficiently. It 
then remains to update using the SWM once again, this time on 

(o l) " (( u; ) ~ ( tx^ 
which is a rank two update of {Uk+i)~^- 

Dual block problem. We can derive a dual to problem (11) by rewriting it as a constrained 
optimization problem to get 

minimize - log xi - t{\og X2 + log X3) - 2t (X^i(log yi + log Zi)) 

subject to xi < w — u'^V~^u , . 

X2 = p + c — w, X3 = p — c + w 

yi = p + bi - Ui, Zi = p-bi + Ui 

in the variables u G I{.^'^~^\w e R, x G R^,y G R^"'"^'',^; G R*^""^''. We can write the Lagrangian 
of this problem as 

L{x,y,z;a,(3,r]) = -logxi - t(logX2 + logxs) - 2t (^^{log yi + log Zi)) 

+ai{xi — w + u^V~^u) + a2{x2 — p — c + w) + as{x2 — p + c — w) (13) 
+ Ei {Pi{yi - p-bi + Ui) + r}i{zi - p + bi- Ui)) 



in the dual variables a G R^,/3 G R*^** and rj G R^" ^\ This Lagrangian is minimized at the 
following point 

^* _ n \ „.* — V{v-P) 

(14) 



x* = {l/ai,t/a2,t/a3), u* = ^2af'' 



y* = 2t/Pi, z* = 2t/vi. 
We can then write the dual to problem (12) as 

maximize 1 + 2t(2n - 1) + log ai - a2{p + c) - as^p - c) -J2i {Pi{p + h) + fli{p- h)) 
+t log{a2/t) + t log(a3 A) + 2t (Ei (log(A/2t) + log(r?i/2t))) 

subject to ai = a2 — as 
CKl > 

in the variables a G R^,/? G R^""^) and r] G R("~^). 
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Stopping criterion. Wc can use this last program to bound the suboptimahty of the current 
iterate when solving (11). Let us write fo{x, y, z, u, w) and g{a, /?, rj) objective functions of problems 
(11) and (15) respectively. Starting from the primal point {x,y, z,u,w) we can form a candidate 
dual solution using (14) as 



and the surrogate duality gap fQ(x,y, z,u,w) — g{a,(],r]) will be an upper bound on the subopti- 
mahty of the current point {u,w) in Algorithm (2). This allows us to reliably stop Algorithm (2) 
when a given target precision is reached. 

3.3 Complexity 

The complexity of moving from a solution X(p) of (2) to a solution X(p + h) can be summarized as 
follows. Solving for the predictor step as in §3.1 requires O(n^) matrix products, but the number of 
iterations necessary to get a good estimate of the predictor is typically much lower (cf. experiments 
in the next section). The corrector steps require solving 0(n) instances of the block problem (11), 
with each problem having a complexity of 0(n^.5) (both coordinate descent and first-order methods 
would be much faster here in practice, but explicit complexity bounds are not available for CD). 
On the other hand, doing a predictor step means that the initial point is very close to the optimal 
solution of (11), hence the number of iterations required by Newton's method is typically lower. 

4 Numerical Results 

In this section, we start by illustrating our problem on a simple example. We then discuss its 
practical use and performance on larger, more realistic data sets. 

4.1 Illustration 

To first illustrate our method on an intuitive data set, we solve for a full regularization path of 
solutions to problem (2) on the covariance matrix of U.S. forward rates for maturities ranging from 
six months to ten years from 1998 until 2005. Forward rates move as a curve, so we expect their 
inverse covariance matrix to be close to band diagonal. Figure 1 shows the dependence network 
obtained from the solution of problem (2) on this matrix along a path, for p = .02, p = .008 
and p = .006. The graph layout was formed using the yFilcs-Organic option in Cytoscape. The 
string-like dynamics of the rates clearly appear in the last plot. 

In Figure 2, we sample from Gaussian model with a given sparse inverse covariance matrix 
(constructed from the sparse forward rate model inferred in the previous example) and (on the left) 
plot area under ROC curve versus number of points used in forming the sample covariance matrix, 
for both covariance selection and a simple thresholding of the inverse matrix coefficients. On the 
right, we also plot area under ROC curve versus p, solving problem (2) with S formed using 20 
samples. Layout was performed using the yFiles- Organic option in cytoscape. 



a2 = t/x2,as 
(3i = 2t/yi, rji 



t/x3,ai = OL2- as, 

ItjZi 
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Figure 1: Three sample dependence graphs corresponding to the solution of problem (2) on a U.S. 
forward rates covariance matrix for p = .02 (left), p = .008 (center) and p = .006 (right). 




# samples P 
Figure 2: Left: Area under ROC curve versus number of sample points used in forming the co- 
variance matrix S for covariance selection with fixed p (circles) and simple thresholding (squares). 
Right: Area under ROC curve versus penalty parameter p for covariance selection when S is gen- 
erated using 20 sample points. The dashed line is the AUROC reached by thresholding the inverse 
matrix coefficients. 



4.2 Experiments 

Here, we first discuss how thresholding is used to bring exact zeros in approximate solutions on the 
regularization path. We also describe how to produce a confidence measure on graph edges. We 
conclude by applying the techniques to a topic model of the journal Science, produced by [18]. 

Thresholding Because we only produce a path of approximate solutions to problem (2), with 
relative duality gap typically of the order 10~^, it is then necessary to threshold the coefficients 
of Once simple heuristic is to prune out all the coefficients of that are smaller than 

maxjj by at least an order of magnitude. Standard matrix perturbation results (see [19] for 
example) then ensure that the effect on the spectrum of X is negligible. 
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Cross validation Leave-one-out cross-validation to get a confidence measure on the graph edges: 
we can remove one variable at random from the data set, compute the dependence graph corre- 
sponding to the remaining variables and repeat this experiment a number of times. Confidence in 
a graph edge is then measured by the frequency with which it appeared in the random subgraphs. 
Perhaps more importantly, this same technique can be used to select the penalty parameter p in 
(2), by simply picking the value of p for which overall graph stability is maximum. 

Performance To generate test matrices here, we sample sparse random matrices and add multi- 
ples of the identity to make them positive semidefinite, then use the inverse matrix as our sample 
matrix E. In Figure 3, wc plot the number of nonzero coefficients (cardinality) in the inverse co- 
variance versus penalty parameter p, along a path of solutions to problem (2). We then plot the 
number of conjugate gradient iterations required to compute the predictor in §3.1 versus number of 
nonzero coefficients in the inverse covariance matrix. We notice that the number of CG iterations 
decreases significantly for sparse matrices, which makes the algorithm faster at the sparse (i.e. 
interesting) end of the regularization path. 

We also compare numerical performance of three methods for computing a full regularization 
path of solutions to problem (2). We first compute a full path of solutions solving separate instances 
of problem (2) using the block coordinate code in [4], we then compute the same path of solutions 
using the methods detailed here and finally repeat this experiment using the Glasso path code [12] 
which restarts the covariance selection problem at p + e at the current solution of (2) obtained 
at p. Both COVSEL and our (prototype) code here are written in MATLAB (excepts for a few 
steps in C), while Glasso is compiled from Fortran and interfaced with R (hence much faster). We 
report CPU time (in seconds) versus problem dimension in Table 1. Note that the predictor step 
in §3.1 could also be used to speed up Glasso, but interfacing issues make this option slower at this 
point. Unfortunately, Glasso does not use the duality gap as a stopping criterion but rather lack 
of progress (average absolute parameter change less than 10~^), it also performs some thresholding 
before returning a solution, so these direct comparisons are somewhat noisy. 



Table 1: CPU time (in seconds) versus problem dimension for computing a regularization path for 
200 values of the penalty p, solving separate covariance selection problems using block coordinate 
descent (Covsel) , the path following method detailed here (Covsel path) and the Glasso code (Glasso 
path). 



Application To illustrate our results on a more realistic data set, we applied our algorithm to 
the covariance matrix extracted from the Correlated Topic Model calibrated in [18] on 10 years of 
art icles from the journal Science, targeting a graph density low enough to reveal some structure. 
The corresponding network is detailed in Figure 4. Graph edge color is related to the sign of the 
conditional correlation (green for positive, red for negative) while edge thickness is proportional to 
the correlation magnitude. The five most important words are listed for each topic. 



Dimension 



Covsel Covsel path Glasso path 



50 
100 



235 11 8 

843 78 25 
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P Cardinality 

Figure 3: Left: We plot the number of nonzero coefficients in the inverse covariance versus penalty 
parameter p, along a path of solutions to problem (2). Right: Number of conjugate gradient 
iterations required to compute the predictor step versus number of nonzero coefficients in the 
inverse covariance matrix. 
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Figure 4: Topic network for the Science Correlated Topic Model in [18]. Network layout using 
cytoscape. Graph edge color is related to the sign of the conditional correlation (green for positive, 
red for negative) while edge thickness is proportional to the correlation magnitude. 
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